Skip to content

USHIFT-6401, USHIFT-6788: Add fail-fast RBAC bootstrap hook deadlock detection#6471

Open
copejon wants to merge 4 commits intoopenshift:mainfrom
copejon:fix-USHIFT-6401
Open

USHIFT-6401, USHIFT-6788: Add fail-fast RBAC bootstrap hook deadlock detection#6471
copejon wants to merge 4 commits intoopenshift:mainfrom
copejon:fix-USHIFT-6401

Conversation

@copejon
Copy link
Copy Markdown
Contributor

@copejon copejon commented Apr 7, 2026

Summary

  • Add parallel RBAC bootstrap hook deadlock detector to kube-apiserver
  • Detect deadlock condition in ~15 seconds instead of waiting 60 seconds
  • Restart microshift-etcd.scope to recover from deadlock
  • USHIFT-6788: Add wall-clock deadline (rbacHookMaxWaitDuration = 30s) to prevent etcd health flapping from extending detection indefinitely

Problem

MicroShift enters a permanent crash loop after first restart due to the RBAC bootstrap hook deadlocking when etcd contains existing data. The rbac/bootstrap-roles PostStartHook uses context.TODO() for API calls with no timeout, causing a circular dependency where the hook waits for API server readiness while the API server waits for the hook to complete.

Related upstream issues: kubernetes/kubernetes#86715, kubernetes/kubernetes#97119

Solution

Add a fail-fast deadlock detector that:

  1. Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically
  2. Checks if etcd is healthy while the hook is stuck
  3. If etcd is healthy but hook not completing for 15 seconds, declares deadlock
  4. Enforces absolute 30s wall-clock deadline regardless of etcd state changes (prevents flapping from extending detection)
  5. Restarts microshift-etcd.scope to force clean recovery

Test plan

  • Start MicroShift fresh - verify normal startup
  • Stop and restart MicroShift - verify no crash loop
  • If deadlock occurs, verify detection in ~15 seconds
  • Verify "RBAC bootstrap hook deadlock detected" log message on recovery
  • Verify wall-clock deadline prevents indefinite flapping extension
  • Verify make verify-go passes
  • Verify make test-unit passes

🤖 Generated with Claude Code via /jira:solve USHIFT-6401, [USHIFT-6788](https://redhat.atlassian.net/browse/USHIFT-6788)

Summary by CodeRabbit

  • Bug Fixes
    • Improved startup reliability by detecting RBAC bootstrap deadlocks via time‑bounded monitoring of readiness endpoints and etcd health; confirms issues before acting to reduce false positives and performs a bounded service restart attempt to recover initialization.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 7, 2026

@copejon: This pull request references USHIFT-6401 which is a valid jira issue.

Details

In response to this:

Summary

  • Add parallel RBAC bootstrap hook deadlock detector to kube-apiserver
  • Detect deadlock condition in ~15 seconds instead of waiting 60 seconds
  • Restart microshift-etcd.scope to recover from deadlock
  • Pure MicroShift-level fix without vendored code changes

Problem

MicroShift enters a permanent crash loop after first restart due to the RBAC bootstrap hook deadlocking when etcd contains existing data. The rbac/bootstrap-roles PostStartHook uses context.TODO() for API calls with no timeout, causing a circular dependency where the hook waits for API server readiness while the API server waits for the hook to complete.

Related upstream issues: kubernetes/kubernetes#86715, kubernetes/kubernetes#97119

Solution

Add a fail-fast deadlock detector that:

  1. Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically
  2. Checks if etcd is healthy while the hook is stuck
  3. If etcd is healthy but hook not completing for 15 seconds, declares deadlock
  4. Restarts microshift-etcd.scope to force clean recovery

Test plan

  • Start MicroShift fresh - verify normal startup
  • Stop and restart MicroShift - verify no crash loop
  • If deadlock occurs, verify detection in ~15 seconds
  • Verify "RBAC bootstrap hook deadlock detected" log message on recovery
  • Verify make verify-go passes
  • Verify make test-unit passes

🤖 Generated with Claude Code via /jira:solve [USHIFT-6401](https://redhat.atlassian.net/browse/USHIFT-6401)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 7, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 7, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a concurrent RBAC bootstrap-hook deadlock detector to kube-apiserver.Run that probes the poststarthook readiness endpoint and etcd health, signals confirmed deadlocks, attempts recovery by stopping microshift-etcd.scope, and causes Run to return a formatted deadlock error.

Changes

Cohort / File(s) Summary
RBAC Deadlock Detection
pkg/controllers/kube-apiserver.go
Converts kubeAPIStartupTimeout to a duration; adds detectRBACHookDeadlock goroutine and rbacDeadlockDetected channel; implements periodic readiness probes against /readyz/poststarthook/rbac/bootstrap-roles, etcd health checks (isEtcdHealthy using etcd v3 Status over TLS), confirmation logic with max checks and wall-clock cap, and recovery via restartMicroshiftEtcdScope (executes systemctl stop microshift-etcd.scope). Adds bounded contexts and logging. (+182/-3).

Sequence Diagram

sequenceDiagram
    participant Run as Run Method
    participant Poll as Readiness Polling
    participant Detector as RBAC Detector
    participant Hook as /readyz/poststarthook/rbac/bootstrap-roles
    participant Etcd as Etcd (v3)
    participant Systemctl as systemctl

    Run->>Poll: start kube-apiserver readiness polling
    Run->>Detector: start RBAC deadlock detector goroutine

    Detector->>Detector: wait initial delay
    loop periodic checks (bounded by confirmation count & max wait)
        Detector->>Hook: GET /readyz/poststarthook/rbac/bootstrap-roles
        Detector->>Etcd: short-timeout Status probe (TLS)
        Etcd-->>Detector: healthy/unhealthy

        alt Etcd healthy & Hook not-ready confirmed
            Detector->>Systemctl: run "systemctl stop microshift-etcd.scope"
            Systemctl-->>Detector: result
            Detector-->>Run: close deadlock channel (signal)
            Run->>Run: log, attempt restart scope, return deadlock error
        else Hook ready or Etcd unhealthy or threshold not reached
            Detector->>Detector: continue polling or reset confirmation
        end
    end
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~40 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding RBAC bootstrap hook deadlock detection with fail-fast behavior. It references relevant Jira issues and clearly conveys the primary purpose of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies controller implementation file, not test files. No Ginkgo test patterns present, check not applicable.
Test Structure And Quality ✅ Passed PR contains only production code modifications; no Ginkgo test code to assess.
Microshift Test Compatibility ✅ Passed PR modifies production code only; no Ginkgo e2e tests introduced, so MicroShift test compatibility check not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR does not add any new Ginkgo e2e tests, so the SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies kube-apiserver controller with deadlock detection and etcd health monitoring. No scheduling constraints, affinity rules, nodeSelector, topology spread constraints, or PodDisruptionBudgets introduced. Changes are operational and diagnostic, compatible with all OpenShift topologies.
Ote Binary Stdout Contract ✅ Passed The PR contains no non-JSON stdout writes in process-level code. The init() function performs only TLS profile lookup with no stdout output, and all logging occurs within method bodies outside the OTE contract scope.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR does not add new Ginkgo e2e tests; changes are exclusively in pkg/controllers/kube-apiserver.go implementing RBAC deadlock detection logic without test files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 7, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 7, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: copejon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/controllers/kube-apiserver.go (1)

542-557: Function name is misleading.

restartMicroshiftEtcdScope only stops the scope; the restart occurs when MicroShift itself restarts after Run() returns an error. Consider renaming to stopMicroshiftEtcdScope for clarity.

♻️ Suggested rename
-// restartMicroshiftEtcdScope restarts the microshift-etcd.scope to recover from deadlock.
-// This forces a clean restart of etcd which can help break the circular dependency.
-func restartMicroshiftEtcdScope() error {
+// stopMicroshiftEtcdScope stops the microshift-etcd.scope to recover from deadlock.
+// MicroShift will restart both etcd and kube-apiserver after Run() returns an error.
+func stopMicroshiftEtcdScope() error {
 	klog.Info("Stopping microshift-etcd.scope for recovery")

Also update the call site at line 438.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` around lines 542 - 557, The function
restartMicroshiftEtcdScope is misnamed because it only stops the scope; rename
the function to stopMicroshiftEtcdScope (and update its comment and klog
messages accordingly) and update every call site that invokes
restartMicroshiftEtcdScope to call stopMicroshiftEtcdScope instead; ensure
exported/internal visibility remains unchanged and run tests/build to verify no
references remain to the old name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 488-496: The current RBAC bootstrap loop resets checkCount to 0
when etcd is unhealthy which allows flapping to prevent ever reaching maxChecks;
modify the logic in the kube-apiserver.go RBAC detection loop (around variables
checkCount, maxChecks, and etcdHealthy) to track a wall-clock deadline (e.g.,
startTime or deadline) when the check begins and do not reset that deadline on
etcd state changes—always compare time.Since(startTime) (or
time.Now().After(deadline)) against a configured maxWaitDuration and fail/exit
the loop if exceeded; keep the existing checkCount logic for deadlock detection
but add this independent timeout check so flapping cannot extend detection
indefinitely.

---

Nitpick comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 542-557: The function restartMicroshiftEtcdScope is misnamed
because it only stops the scope; rename the function to stopMicroshiftEtcdScope
(and update its comment and klog messages accordingly) and update every call
site that invokes restartMicroshiftEtcdScope to call stopMicroshiftEtcdScope
instead; ensure exported/internal visibility remains unchanged and run
tests/build to verify no references remain to the old name.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 8e1dd33d-45f9-439d-91af-e5c44c9150fc

📥 Commits

Reviewing files that changed from the base of the PR and between 2cc28ca and db029bd.

📒 Files selected for processing (1)
  • pkg/controllers/kube-apiserver.go

Comment thread pkg/controllers/kube-apiserver.go
@copejon
Copy link
Copy Markdown
Contributor Author

copejon commented Apr 7, 2026

@coderabbitai help

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 7, 2026

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
    • @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
    • @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve merge conflict to automatically resolve merge conflicts.
  • @coderabbitai autofix to automatically fix issues identified in unresolved review comments.
  • @coderabbitai resolve to resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

  • Visit our status page to check the current availability of CodeRabbit.
  • Create a ticket on our support page for assistance with any issues or questions.
  • Visit our documentation site for detailed information on how to use CodeRabbit.
  • Join our Discord community to connect with other users and get help from the community.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 469-472: The loop that decides to signal deadlock is incrementing
checkCount and falling through to close(deadlockDetected) on
rbacHookMaxWaitDuration expiry even when isEtcdHealthy() or the RBAC probe
errored; modify the loop (the block around checkCount, isEtcdHealthy,
rbacHookMaxWaitDuration, maxChecks, and close(deadlockDetected)) so that
checkCount is only incremented when the "healthy-etcd stall" predicate is
actually observed (i.e., when both the RBAC probe is unfinished AND
isEtcdHealthy() returns true), ignore/skip increments when probes or health
checks error, and on hitting rbacHookMaxWaitDuration return early unless
checkCount has already reached maxChecks; ensure close(deadlockDetected) is only
called after the predicate has been confirmed maxChecks times.
- Around line 483-486: The RBAC readiness probe call using
restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(ctx)...
uses the long-lived controller ctx and can hang; wrap that request in a short
per-probe context with timeout (e.g., context.WithTimeout(ctx, 2*time.Second)),
use the derived probeCtx in Do(probeCtx), and defer cancel() so the goroutine
unblocks and the 2s interval / 30s max-wait logic can re-evaluate; update the
call that sets status and err (StatusCode(&status).Error()) to use the probe
context.
- Around line 558-563: The call to stopCmd.CombinedOutput() in
restartMicroshiftEtcdScope is unbounded and can hang; wrap the systemctl
invocation with a context deadline (e.g., create a context.WithTimeout and use
exec.CommandContext) and use that context when constructing stopCmd, then read
CombinedOutput (or CombinedOutput on the context-backed command) and return a
timeout-aware error if the context expires; reference
restartMicroshiftEtcdScope, stopCmd and CombinedOutput and ensure you cancel the
context and propagate/annotate context timeout errors in the returned fmt.Errorf
message.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 6d409d90-afaf-448d-81e1-02888bc827c6

📥 Commits

Reviewing files that changed from the base of the PR and between 2cc28ca and 6de022e.

📒 Files selected for processing (1)
  • pkg/controllers/kube-apiserver.go

Comment thread pkg/controllers/kube-apiserver.go Outdated
Comment thread pkg/controllers/kube-apiserver.go
Comment thread pkg/controllers/kube-apiserver.go Outdated
@copejon copejon changed the title USHIFT-6401: Add fail-fast RBAC bootstrap hook deadlock detection USHIFT-6401, USHIFT-6788: Add fail-fast RBAC bootstrap hook deadlock detection Apr 8, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 8, 2026

@copejon: This pull request references USHIFT-6401 which is a valid jira issue.

This pull request references USHIFT-6788 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add parallel RBAC bootstrap hook deadlock detector to kube-apiserver
  • Detect deadlock condition in ~15 seconds instead of waiting 60 seconds
  • Restart microshift-etcd.scope to recover from deadlock
  • USHIFT-6788: Add wall-clock deadline (rbacHookMaxWaitDuration = 30s) to prevent etcd health flapping from extending detection indefinitely

Problem

MicroShift enters a permanent crash loop after first restart due to the RBAC bootstrap hook deadlocking when etcd contains existing data. The rbac/bootstrap-roles PostStartHook uses context.TODO() for API calls with no timeout, causing a circular dependency where the hook waits for API server readiness while the API server waits for the hook to complete.

Related upstream issues: kubernetes/kubernetes#86715, kubernetes/kubernetes#97119

Solution

Add a fail-fast deadlock detector that:

  1. Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically
  2. Checks if etcd is healthy while the hook is stuck
  3. If etcd is healthy but hook not completing for 15 seconds, declares deadlock
  4. Enforces absolute 30s wall-clock deadline regardless of etcd state changes (prevents flapping from extending detection)
  5. Restarts microshift-etcd.scope to force clean recovery

Test plan

  • Start MicroShift fresh - verify normal startup
  • Stop and restart MicroShift - verify no crash loop
  • If deadlock occurs, verify detection in ~15 seconds
  • Verify "RBAC bootstrap hook deadlock detected" log message on recovery
  • Verify wall-clock deadline prevents indefinite flapping extension
  • Verify make verify-go passes
  • Verify make test-unit passes

🤖 Generated with Claude Code via /jira:solve [USHIFT-6401](https://redhat.atlassian.net/browse/USHIFT-6401), [USHIFT-6788](https://redhat.atlassian.net/browse/USHIFT-6788)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@copejon copejon marked this pull request as ready for review April 16, 2026 14:12
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 16, 2026
@openshift-ci openshift-ci Bot requested review from eslutsky and pacevedom April 16, 2026 14:13
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 16, 2026

@copejon: This pull request references USHIFT-6401 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead.

This pull request references USHIFT-6788 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add parallel RBAC bootstrap hook deadlock detector to kube-apiserver
  • Detect deadlock condition in ~15 seconds instead of waiting 60 seconds
  • Restart microshift-etcd.scope to recover from deadlock
  • USHIFT-6788: Add wall-clock deadline (rbacHookMaxWaitDuration = 30s) to prevent etcd health flapping from extending detection indefinitely

Problem

MicroShift enters a permanent crash loop after first restart due to the RBAC bootstrap hook deadlocking when etcd contains existing data. The rbac/bootstrap-roles PostStartHook uses context.TODO() for API calls with no timeout, causing a circular dependency where the hook waits for API server readiness while the API server waits for the hook to complete.

Related upstream issues: kubernetes/kubernetes#86715, kubernetes/kubernetes#97119

Solution

Add a fail-fast deadlock detector that:

  1. Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically
  2. Checks if etcd is healthy while the hook is stuck
  3. If etcd is healthy but hook not completing for 15 seconds, declares deadlock
  4. Enforces absolute 30s wall-clock deadline regardless of etcd state changes (prevents flapping from extending detection)
  5. Restarts microshift-etcd.scope to force clean recovery

Test plan

  • Start MicroShift fresh - verify normal startup
  • Stop and restart MicroShift - verify no crash loop
  • If deadlock occurs, verify detection in ~15 seconds
  • Verify "RBAC bootstrap hook deadlock detected" log message on recovery
  • Verify wall-clock deadline prevents indefinite flapping extension
  • Verify make verify-go passes
  • Verify make test-unit passes

🤖 Generated with Claude Code via /jira:solve [USHIFT-6401](https://redhat.atlassian.net/browse/USHIFT-6401), [USHIFT-6788](https://redhat.atlassian.net/browse/USHIFT-6788)

Summary by CodeRabbit

  • Bug Fixes
  • Introduced detection mechanism for RBAC bootstrap initialization deadlocks with automatic recovery procedures. System now monitors initialization health and attempts service restart when deadlock conditions are identified to improve overall reliability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/controllers/kube-apiserver.go (1)

468-515: ⚠️ Potential issue | 🟠 Major

Wall-clock deadline break still triggers false deadlock signal.

When rbacHookMaxWaitDuration expires at line 472, the loop breaks and falls through to close(deadlockDetected) at line 515—even if etcd was never confirmed healthy. The "etcd healthy + hook stuck" predicate may never have been observed.

Additionally, checkCount++ at line 481 executes before conditions are verified. If isEtcdHealthy errors, the count advances without observing the deadlock predicate.

Proposed fix
+	deadlockPredicateObserved := false
+
 	for checkCount < maxChecks {
 		// Check absolute deadline first - this cannot be reset by etcd state changes
 		if time.Since(startTime) >= rbacHookMaxWaitDuration {
 			klog.Errorf("RBAC bootstrap hook exceeded maximum wait duration of %v", rbacHookMaxWaitDuration)
-			break
+			if !deadlockPredicateObserved {
+				klog.Warning("Wall-clock deadline reached but deadlock predicate never confirmed - not signaling deadlock")
+				return
+			}
+			break // predicate was observed, signal deadlock
 		}
 
 		select {
 		case <-ctx.Done():
 			return
 		case <-time.After(rbacHookCheckInterval * time.Second):
 		}
 
-		checkCount++
-
 		// Check RBAC hook status
 		var status int
 		err := restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(ctx).StatusCode(&status).Error()
 
 		// If hook is ready, no deadlock
 		if err == nil && status == 200 {
 			klog.V(4).Info("RBAC bootstrap hook completed successfully")
 			return
 		}
 
 		// Hook not ready - check if etcd is healthy
 		etcdHealthy, etcdErr := isEtcdHealthy(ctx)
 		if etcdErr != nil {
 			klog.V(4).Infof("Could not check etcd health: %v", etcdErr)
 			continue
 		}
 
 		if etcdHealthy {
+			checkCount++
+			deadlockPredicateObserved = true
 			klog.Warningf("RBAC bootstrap hook not ready (check %d/%d, elapsed %v), but etcd is healthy - potential deadlock",
 				checkCount, maxChecks, time.Since(startTime).Round(time.Second))
 		} else {
 			// etcd not healthy - not a deadlock, just waiting for etcd
 			klog.V(4).Infof("RBAC hook waiting, etcd not yet healthy (check %d/%d)", checkCount, maxChecks)
-			// Reset counter since this isn't a deadlock condition
-			// Note: wall-clock deadline (startTime) is NOT reset - flapping cannot extend indefinitely
 			checkCount = 0
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` around lines 468 - 515, The loop currently
can fall through to close(deadlockDetected) on wall-clock timeout even if etcd
was never observed healthy and also increments checkCount before confirming etcd
health; fix by tracking whether etcd has ever been observed healthy (e.g., add
sawEtcdHealthy bool set to true when isEtcdHealthy(ctx) returns (true, nil)),
only increment checkCount when etcdHealthy==true and etcdErr==nil (move the
checkCount++ after the etcd health success path), and after the loop only signal
deadlock (close(deadlockDetected)) if sawEtcdHealthy is true and the RBAC hook
still isn’t ready; leave the wall-clock timeout to break the loop but not
unconditionally declare a deadlock. Ensure references:
restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(...).StatusCode(&status).Error(),
isEtcdHealthy(ctx), checkCount, sawEtcdHealthy, and deadlockDetected.
🧹 Nitpick comments (1)
pkg/controllers/kube-apiserver.go (1)

566-568: Consider context-aware wait.

The 1s time.Sleep isn't interruptible. Low impact but a select with ctx.Done() would be cleaner for graceful shutdown.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` around lines 566 - 568, Replace the
uninterruptible time.Sleep(1 * time.Second) with a context-aware wait so
shutdown can be cancelled: where the sleep occurs (the time.Sleep call in
kube-apiserver.go cleanup/wait block), use a select that waits on time.After(1 *
time.Second) and also listens on ctx.Done(), returning early if ctx is
cancelled; ensure you reference the existing ctx variable and keep the same
timeout duration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 468-515: The loop currently can fall through to
close(deadlockDetected) on wall-clock timeout even if etcd was never observed
healthy and also increments checkCount before confirming etcd health; fix by
tracking whether etcd has ever been observed healthy (e.g., add sawEtcdHealthy
bool set to true when isEtcdHealthy(ctx) returns (true, nil)), only increment
checkCount when etcdHealthy==true and etcdErr==nil (move the checkCount++ after
the etcd health success path), and after the loop only signal deadlock
(close(deadlockDetected)) if sawEtcdHealthy is true and the RBAC hook still
isn’t ready; leave the wall-clock timeout to break the loop but not
unconditionally declare a deadlock. Ensure references:
restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(...).StatusCode(&status).Error(),
isEtcdHealthy(ctx), checkCount, sawEtcdHealthy, and deadlockDetected.

---

Nitpick comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 566-568: Replace the uninterruptible time.Sleep(1 * time.Second)
with a context-aware wait so shutdown can be cancelled: where the sleep occurs
(the time.Sleep call in kube-apiserver.go cleanup/wait block), use a select that
waits on time.After(1 * time.Second) and also listens on ctx.Done(), returning
early if ctx is cancelled; ensure you reference the existing ctx variable and
keep the same timeout duration.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 8aa2f8a5-61b0-40e5-a4e9-2c35b99a9835

📥 Commits

Reviewing files that changed from the base of the PR and between 2cc28ca and 6de022e.

📒 Files selected for processing (1)
  • pkg/controllers/kube-apiserver.go

@copejon
Copy link
Copy Markdown
Contributor Author

copejon commented Apr 16, 2026

/retest

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 17, 2026

@copejon: This pull request references USHIFT-6401 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead.

This pull request references USHIFT-6788 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add parallel RBAC bootstrap hook deadlock detector to kube-apiserver
  • Detect deadlock condition in ~15 seconds instead of waiting 60 seconds
  • Restart microshift-etcd.scope to recover from deadlock
  • USHIFT-6788: Add wall-clock deadline (rbacHookMaxWaitDuration = 30s) to prevent etcd health flapping from extending detection indefinitely

Problem

MicroShift enters a permanent crash loop after first restart due to the RBAC bootstrap hook deadlocking when etcd contains existing data. The rbac/bootstrap-roles PostStartHook uses context.TODO() for API calls with no timeout, causing a circular dependency where the hook waits for API server readiness while the API server waits for the hook to complete.

Related upstream issues: kubernetes/kubernetes#86715, kubernetes/kubernetes#97119

Solution

Add a fail-fast deadlock detector that:

  1. Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically
  2. Checks if etcd is healthy while the hook is stuck
  3. If etcd is healthy but hook not completing for 15 seconds, declares deadlock
  4. Enforces absolute 30s wall-clock deadline regardless of etcd state changes (prevents flapping from extending detection)
  5. Restarts microshift-etcd.scope to force clean recovery

Test plan

  • Start MicroShift fresh - verify normal startup
  • Stop and restart MicroShift - verify no crash loop
  • If deadlock occurs, verify detection in ~15 seconds
  • Verify "RBAC bootstrap hook deadlock detected" log message on recovery
  • Verify wall-clock deadline prevents indefinite flapping extension
  • Verify make verify-go passes
  • Verify make test-unit passes

🤖 Generated with Claude Code via /jira:solve [USHIFT-6401](https://redhat.atlassian.net/browse/USHIFT-6401), [USHIFT-6788](https://redhat.atlassian.net/browse/USHIFT-6788)

Summary by CodeRabbit

  • Bug Fixes
  • Added runtime detection for RBAC bootstrap deadlocks with automatic recovery: the system now detects stuck initialization, logs an error, and attempts a bounded service restart to recover initialization and improve startup reliability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 59-69: rbac timing constants rbacHookDeadlockTimeout and
rbacHookCheckInterval are plain integers but used as time.Duration; change them
to durations by multiplying by time.Second (rbacHookDeadlockTimeout = 15 *
time.Second and rbacHookCheckInterval = 2 * time.Second), then update the
log/format that prints the timeout to use %s (duration) instead of “%d seconds”,
and remove any places that multiply rbacHookCheckInterval by time.Second (e.g.
the spot that computes maxChecks / or multiplies the interval) so all arithmetic
uses time.Duration consistently.
- Around line 496-515: The current RBAC probe logic treats any non-nil err from
restClient.Get() as a transport error and skips deadlock counting; change the
condition so only requests with no HTTP status code (transport/timeout errors)
are skipped — i.e., update the check around restClient.Get()/probeCtx to
skip/continue only when err != nil AND status == 0 (use the existing status
integer and err from the Do(...).StatusCode(&status).Error() call), while
allowing HTTP responses (even 5xx where status != 0) to be counted toward the
deadlock check and preserve the existing klog message for the skipped case.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 7ca5fc05-cf7b-4e48-970f-e7400f1307ef

📥 Commits

Reviewing files that changed from the base of the PR and between 7daee5d and 184e1cb.

📒 Files selected for processing (1)
  • pkg/controllers/kube-apiserver.go

Comment thread pkg/controllers/kube-apiserver.go
Comment thread pkg/controllers/kube-apiserver.go
copejon and others added 3 commits April 22, 2026 10:20
…tion

On MicroShift restart, the RBAC bootstrap hook can deadlock when etcd
contains existing data. The hook uses context.TODO() for API calls,
which has no timeout. When the loopback client hangs, this creates a
circular dependency where the hook waits for the API server while the
API server waits for the hook to complete.

This change adds a parallel deadlock detector that:
- Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically
- Checks if etcd is healthy while the hook is stuck
- Detects deadlock in ~15 seconds instead of waiting 60 seconds
- Restarts microshift-etcd.scope to recover from the deadlock

This breaks the crash loop by detecting the condition early and taking
recovery action at the MicroShift level, without requiring changes to
vendored upstream Kubernetes code.

Related upstream issues: kubernetes/kubernetes#86715, #97119

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add rbacHookMaxWaitDuration (30s) as an absolute deadline that cannot
be reset by etcd health state changes. This prevents a flapping etcd
from extending the deadlock detection indefinitely.

The existing checkCount logic is preserved for detecting deadlock when
etcd is consistently healthy, but the wall-clock deadline provides a
hard upper bound regardless of etcd state transitions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…itives

- Only increment checkCount when deadlock predicate confirmed
  (RBAC not ready AND etcd healthy)
- Skip counting when RBAC probe or etcd health check errors
- On wall-clock timeout, only trigger recovery if checkCount >= maxChecks
- Add 1s timeout to RBAC probe to prevent hanging on unresponsive API
- Add 5s timeout to systemctl stop to prevent recovery path from stalling
- Extract rbacHookPollDelayStart constant for clarity

Prevents false positive deadlock detection when etcd flaps or probes
error, ensuring close(deadlockDetected) only fires after confirming
the deadlock condition the required number of times.

Related: kubernetes/kubernetes#86715, #97119

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@copejon copejon force-pushed the fix-USHIFT-6401 branch 2 times, most recently from b64ecdb to 097404a Compare April 22, 2026 16:50
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
pkg/controllers/kube-apiserver.go (2)

371-371: ⚠️ Potential issue | 🔴 Critical

Use duration constants directly.

Line 371 now overflows because kubeAPIStartupTimeout is already a time.Duration. Line 493 also turns the 2s detector interval into a multi-decade sleep. Use the constants directly and format durations with %s.

Proposed fix
-		err := wait.PollUntilContextTimeout(ctx, time.Second, kubeAPIStartupTimeout*time.Second, true, func(ctx context.Context) (bool, error) {
+		err := wait.PollUntilContextTimeout(ctx, time.Second, kubeAPIStartupTimeout, true, func(ctx context.Context) (bool, error) {
...
-		return fmt.Errorf("RBAC bootstrap hook deadlock detected after %d seconds", rbacHookDeadlockTimeout)
+		return fmt.Errorf("RBAC bootstrap hook deadlock detected after %s", rbacHookDeadlockTimeout)
...
-	case <-time.After(rbacHookCheckInterval * time.Second):
+	case <-time.After(rbacHookCheckInterval):

Verify the stale multipliers are gone:

#!/bin/bash
rg -n 'kubeAPIStartupTimeout\s*\*\s*time\.Second|rbacHookCheckInterval\s*\*\s*time\.Second|after %d seconds' pkg/controllers/kube-apiserver.go

Also applies to: 446-446, 490-494

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` at line 371, The code is multiplying
existing time.Duration constants (e.g. kubeAPIStartupTimeout,
rbacHookCheckInterval) by time.Second which causes overflow and wrong intervals;
update calls like wait.PollUntilContextTimeout(ctx, time.Second,
kubeAPIStartupTimeout*time.Second, ...) and any other occurrences
(rbacHookCheckInterval*time.Second) to pass the durations directly (use
kubeAPIStartupTimeout and rbacHookCheckInterval as the timeout/interval
arguments) and change any formatted log strings that use "after %d seconds" to
use "%s" with the duration value; search for usages in functions referencing
kubeAPIStartupTimeout and rbacHookCheckInterval and replace the multipliers and
integer %d format with the direct duration and %s respectively.

496-515: ⚠️ Potential issue | 🟠 Major

Don’t skip HTTP not-ready responses.

client-go can return both status != 0 and err != nil for HTTP error responses, so Line 513 skips the RBAC “not ready” signal the detector needs. Only skip transport/timeout failures where no HTTP status was received.

Proposed fix
-		if err != nil {
+		if err != nil && status == 0 {
 			klog.V(4).Infof("RBAC probe error (not counting toward deadlock): %v", err)
 			continue
 		}

Verify client-go Result status/error behavior in the vendored source:

#!/bin/bash
sed -n '496,515p' pkg/controllers/kube-apiserver.go
rg -n -A8 -B4 'func \(r .*Result\) StatusCode|func \(r .*Result\) Error|func \(r \*Request\) transformResponse' ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` around lines 496 - 515, The code currently
treats any err from restClient.Get().Do(...).Error() as a transport/timeout
failure and skips counting toward deadlock; change that logic so only
transport-level failures (err != nil AND status == 0) are skipped. In other
words, keep the probe request creation (probeCtx, cancel, restClient.Get...
StatusCode(&status).Error()), then replace the "if err != nil { ... continue }"
check with a conditional that logs and continues only when err != nil && status
== 0; if err != nil but status != 0, treat it as an HTTP error response (do not
continue) so the RBAC “not ready” HTTP status is observed by the detector
(referencing probeCtx, status, and err).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controllers/kube-apiserver.go`:
- Line 70: The const block in kube-apiserver.go contains a stray character "1"
immediately before its closing parenthesis which breaks Go syntax; remove that
stray "1" so the const (...) block ends with just the closing ')' and nothing
else, ensuring the const declaration (the const block surrounding the
kube-apiserver constants) compiles cleanly.

---

Duplicate comments:
In `@pkg/controllers/kube-apiserver.go`:
- Line 371: The code is multiplying existing time.Duration constants (e.g.
kubeAPIStartupTimeout, rbacHookCheckInterval) by time.Second which causes
overflow and wrong intervals; update calls like
wait.PollUntilContextTimeout(ctx, time.Second,
kubeAPIStartupTimeout*time.Second, ...) and any other occurrences
(rbacHookCheckInterval*time.Second) to pass the durations directly (use
kubeAPIStartupTimeout and rbacHookCheckInterval as the timeout/interval
arguments) and change any formatted log strings that use "after %d seconds" to
use "%s" with the duration value; search for usages in functions referencing
kubeAPIStartupTimeout and rbacHookCheckInterval and replace the multipliers and
integer %d format with the direct duration and %s respectively.
- Around line 496-515: The code currently treats any err from
restClient.Get().Do(...).Error() as a transport/timeout failure and skips
counting toward deadlock; change that logic so only transport-level failures
(err != nil AND status == 0) are skipped. In other words, keep the probe request
creation (probeCtx, cancel, restClient.Get... StatusCode(&status).Error()), then
replace the "if err != nil { ... continue }" check with a conditional that logs
and continues only when err != nil && status == 0; if err != nil but status !=
0, treat it as an HTTP error response (do not continue) so the RBAC “not ready”
HTTP status is observed by the detector (referencing probeCtx, status, and err).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 16493598-1abd-4ebf-8630-0d3b8ae52d52

📥 Commits

Reviewing files that changed from the base of the PR and between 21b06c5 and 097404a.

📒 Files selected for processing (1)
  • pkg/controllers/kube-apiserver.go

Comment thread pkg/controllers/kube-apiserver.go Outdated
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 22, 2026

@copejon: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@copejon
Copy link
Copy Markdown
Contributor Author

copejon commented Apr 22, 2026

/verified by ci

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 22, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@copejon: This PR has been marked as verified by ci.

Details

In response to this:

/verified by ci

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants